AdductHunter: identifying protein-metal complex adducts in mass spectra

Mass spectrometry (MS) is an analytical technique for molecule identification that can be used for investigating protein-metal complex interactions. Once the MS data is collected, the mass spectra are usually interpreted manually to identify the adducts formed as a result of the interactions between proteins and metal-based species. However, with increasing resolution, dataset size, and species complexity, the time required to identify adducts and the error-prone nature of manual assignment have become limiting factors in MS analysis. AdductHunter is a open-source web-based analysis tool that automates the peak identification process using constraint integer optimization to find feasible combinations of protein and fragments, and dynamic time warping to calculate the dissimilarity between the theoretical isotope pattern of a species and its experimental isotope peak distribution. Empirical evaluation on a collection of 22 unique MS datasetsshows fast and accurate identification of protein-metal complex adducts in deconvoluted mass spectra. Supplementary Information The online version contains supplementary material available at 10.1186/s13321-023-00797-7.


Introduction
Mass spectrometry (MS) is a well-established analytical technique for chemical identification and molecular weight determination of various analytes [1].The experimental output is a mass spectrum consisting of intensity values at corresponding mass-to-charge ratios (m/z).Analysis of small molecules by electrospray ionization (ESI)-MS, one of the most widely used MS techniques, results in mostly singly-charged ions.In the case of proteins or other biomolecules, which have much higher molecular weights, charge state envelopes are formed from ions at different charge states, but originate from the same molecule.The isotopes of the elements present in the protein and its adducts change the isotope peak pattern for each peak, forming Gaussian-type profiles.Due to the complexity of such spectra, maximum entropy deconvolution [2,3] as a pre-processing step facilitates the analysis of proteins reconstituting the charge state envelope for each species detected into neutral mass peaks.
MS has proven particularly valuable in characterizing metallodrug interactions with proteins, e.g., proteinmetal complex stoichiometry, adduct composition, binding sites, and structural changes [4][5][6][7][8][9][10][11][12].For current metallodrugs to progress toward clinical development, it is crucial to understand the pharmacological properties, notably metallodrug-protein interactions [13][14][15].However, interpreting mass spectra is typically done manually, which can be time-consuming, tedious, and error-prone due to the complexity of mass spectra, in particular for reactive species that can undergo changes not only upon interacting with proteins, but also by reaction with matrix components or during the analysis process with solvent molecules.
Software solutions have been explored to automatize the identification of protein adducts, but for example, Analysis of Protein Modifications from Mass Spectra ( Apm 2 s) [16] is targeted at proteomics workflows, pyOpenMS [17] is a mass spectrometry-based proteomics analysis tool but not specifically designed for identifying protein-metal complex adducts.The Nesvizhskii lab and collaborators have created a suite of software 1 for proteomics and metabolomics applications [18][19][20][21][22] which are well supported for these applications.mMass [23] and pyQms [24] are either again focused on proteomics or metabolomics, and limited to a narrow set of inputs, or have had no further development and support in recent years.Therefore, AdductHunter is introduced here as a webbased tool that automates the identification of proteinmetal complex adducts in deconvoluted mass spectra, which, to the best of our knowledge, is the first tool of its kind for this purpose.

Implementation
AdductHunter is a web-based tool that automates the identification of protein adducts in deconvoluted mass spectra (see Fig. 1).It requires a series of input files and parameters, returning a downloadable output file that contains a list of (feasible) species corresponding to different peaks in the input spectrum (see Fig. 2 for its general algorithm).These species are sorted by their similarity to the experimental peaks as scored by closeness of fit (loss) to isotope pattern and mass error.

Input files
Three input files are required; (1) the deconvoluted mass spectrum, for example obtained using Maximum Entropy Deconvolution in Bruker DataAnalysis to produce a charge neutral spectrum [2,3]; (2) a file that lists the protein and any atoms, ions, and solvents contained in the sample and their corresponding constraints, such as charge and coordination number, the number of expected adducts formed; and (3) a description of the standard adducts involved in the sample and their corresponding constraints, which is expected to be very similar for most experiments.These are required to be of .xlsxor .csvfile types and assumed to have the correct formatting (see Tables 1, 2, 3 and Additional Files 1 and 2 for examples of the expected layout for these input files).

Peak identification
AdductHunter begins by identifying peaks within the mass spectrum.Users are required to specify three parameters involved in this process: (1) the (normalized) noise threshold or minimum peak height; (2) the minimum distance between adjacent peaks in atomic mass units; and (3) whether a linear re-calibration of the spectrum is required using known peaks as internal standards.In the case that a re-calibration is applied,  all mass-to-charge values are shifted equally, either according to the difference between the (theoretical) peak isotopic mass of the protein and the closest identified isotopologue peak in the mass spectrum, or a user-specified value.With these parameters set, peaks are identified in a two-step process.The spectrum intensities are first normalized to the most abundant peak, then peaks exceeding the minimum height threshold are identified using SciPy's peak detection function, which yielded similar results to several recently reported MS peak detection algorithms [31][32][33].Higherresolution MS, however, picks up a much greater number of low intensity peaks, leading to more peaks having an intensity larger than the noise threshold and a significant number of false positive peaks.As a result, a second filtering step was included in the peak identification process.Filtering uses the minimum distance between peaks to remove peaks belonging to the same species, ensuring isotope peaks within the same isotope pattern are only detected once, a feature increasingly relevant in mass spectra collected with higher resolution instruments (see Fig. 3).Additionally, users can specify to only return detected peaks with at least one feasible species in the output.
The peak isotopic mass refers to the highest nominal mass peak by intensity-weighted average of the hyperfine mass distribution (at each integer mass) of a species.The most abundant isotopologue typically matches that of commercial isotope pattern predictors on the scale of 10 −3 parts per million (ppm), e.g., the Bruker isotope pattern generator [34].These mass values are later used in the constraint optimization formulation to linearly approximate the true mass value of a species.

Optimization problem
Once peaks within the mass spectrum have been detected, AdductHunter will determine their corresponding speciations by formulating an optimization problem, involving an objective function subject to a set of constraints, at each identified peak p.The objective function measures the dissimilarity (distance) between the theoretical isotope pattern of a given species and the experimental isotope distribution of the peak.The constraints are established from user-defined parameters  An empty value means there is no limit, which is expected here as these adducts do not coordinate to the metal centre The charge needs to be provided as positive or negative integers Species Formula Min Max M Charge and files, forming the set of feasible solutions.In the context of this problem, a feasible solution refers to a combination of input compounds that gives a potential species matching the peak-centred experimental isotopic distribution.This gives the general formulation: where x is the vector of the number of molecules for each compound, φ(x, p) is the dissimilarity between species x and the experimental distribution around peak p, and F(p) is the set of feasible species at peak p.For example, to represent UbPt( NH 3 ) 2 , we would have Due to noise and inaccuracies in the collection and averaging of mass spectra, the true species may not be optimal, that is, there exists another species that has an isotope pattern more similar to the experimental isotope distribution.However, the correct species is highly likely to be contained within the feasible set, if sensible constraints and parameters have been provided.Thus, returning all feasible species is helpful for postoptimization validation and analysis.

Constraint integer optimization formulation
A constraint integer optimization (CIO) formulation is a type of integer optimization formulation where all feasible integer solutions are returned.The formulation takes advantage of the problem structure and constraints to ensure sensible species are generated.Although once thought to be intractable, it has shown great advances in efficiency and speed in recent years, and can be solved quickly using industry-grade solvers such as CPLEX [35] (1) and GUROBI [36], much faster than enumerating all possible solutions.
To start, the decision variables, x i , are defined as the number of molecules present for protein/adduct i, each having a mass of m i , for every i in the set of all species, C. The mass value here takes into account the charge discrepancy from adding a metal-based fragment, c i , of spe- cies i by removing the mass of c i protons from its peak isotopic mass, that is, where P i is the most abundant isotopologue of the species.
Constants/parameters in the formulation are defined in either the web application or the compound constraint files.The user-defined parameters in the web application are as follows: i.The peak tolerance, t, defined as the neighbourhood of mass values around a peak p, at which a combination of species forms a feasible protein adduct.It is enforced by the constraint: ii.The maximum number of unique standard adducts, r, in any feasible solution.We define standard adducts as those adducts frequently observed in ESI-MS, that is, the alkali metal ions Na+, Li+ and K+, as well as H+.To enforce this, the indicator variables d s = {standard adduct s is selected} , to track which standard adducts have been selected, will need to be added with the following constraint: where S is the set of all standard adducts and iii.The minimum, g, and maximum, h, number of proteins in a multi-protein use case, in any feasible solution.Again, the indicator variables z a = {primary a is selected} , to track which primaries have been selected, will need to be added with the following constraint: where A is the set of all primaries.
(2) iv.The coordination number, v, of metal k used.The coordination number for linear complexes is 2, for square planar and tetrahedral complexes is 4, and for octahedral complexes is 6, to name a few.For cisplatin, the platinum (II) metal center has v = 4 .AdductHunter supports one type of metal at a time.This constraint is enforced by: For the compound description/constraint files (see Tables 2, 3), the user-defined parameters are as follows: v.The lower and upper bounds l i and u i , respectively, for species i in any feasible solution, enforced by constraints: vi.The maximum number of coordinating species, n j , per metal k for each binding species j, enforced by the constraint: where B is the set of all binding compounds.Putting all of this together, along with non-negativity constraints, the final CIO formulation defining the feasible set F(p) at peak p is: As an example, we illustrate in the system involving ubiquitin incubated with cisplatin the CIO formulation defining the feasible set at the peak corresponding to a mass of 8774.6028Da: (7) i∈C\{k,S} where the set of all species C = Ubiquitin, Platinum, Ammonia, . . ., Potassium , the set of binding compounds B = Ammonia, Water, Chlorine , the set of standard adducts S = Lithium, Sodium, Potassium , the peak tol- erance t = 2 , the maximum number of unique standard adducts r = 2 , the metal k is Platinum with a coordina- tion number v = 4 , and the maximum number of coordi- nating species is n j = 2 for all binding compounds j ∈ B .Notice that since there is only one protein in this system, we do not have the multi-protein constraint.

Objective function
With the constraints established, we require an objective function that measures the similarity in shape and mass between theoretical and experimental isotope distributions, or the dissimilarity assuming a minimization problem.Furthermore, the effects of preceding and succeeding noisy peaks far from the peak for the most abundant isotopologue should be ignored, as well as intensities below a certain height due to the noise in high-resolution data-these do not help the measurement of similarity.Thus, only values within a certain user-specified interval of the current peak are considered when comparing the theoretical and experimental distributions.AdductHunter uses Dynamic Time Warping (DTW) [37] to find the dissimilarity between distributions, that is, φ(x, p) is the (Euclidean) distance between the optimally aligned theoretical and experimental isotopic distributions.DTW works by computing a distance matrix between the two isotopic distributions, where each cell in the matrix represents the distance between a specific point in one distribution and a specific point in the other distribution.The optimal path through the distance matrix that minimizes the total distance between the two distributions is then computed by constructing a cost matrix that accumulates the distances between all possible pairs of points in the two distributions.The cost matrix is then traversed in a way that minimizes the total (11) 8772.6028 accumulated cost along the path, that is, the optimally aligned dissimilarity between the two distributions.

Output table
After the optimization problem is solved, a table of feasible protein adducts with an indication of the closest fit for each peak is returned (see Table 4).The table is sorted by experimental peak mass and closeness of fit (loss).Here, the theoretical peak mass is recorded as the most abundant isotopologue for a given species and is used to calculate mass error in ppm.

Results and discussion
We examined the performance of AdductHunter on a variety of datasets to understand its effectiveness in accurately identifying protein adducts, as well as discuss here its limitations and further development.
AdductHunter was specifically developed to identify adducts formed between metal complexes and proteins.A collection of 22 unique datasets was analyzed to provide a comprehensive performance benchmark for AdductHunter (see Table 5).The metal complexes used were cisplatin, oxaliplatin, RAPTA-C, RM-175, Au-1, Au-2, and Au-3, with formulas cis-Pt(NH  .Each data set contained a mixture of at least one protein and one metal complex (see Table 5).We compared the output from AdductHunter for each dataset against the corresponding ground truth, that is, the manually identified protein adducts.

Peak identification
Peak detection in mass spectra is subject to identifying many false positives, especially at low intensities where noise is prevalent.Here, we define false positives as  Au-2 CyC, Ub qTOF 23 [41] Au-3 CyC, Ub qTOF 32 [41] peaks detected by the tool but not ground truth peaks, and false negatives as ground truth peaks that were not picked up by the tool.The peak detection algorithm in AdductHunter is highly sensitive to the normalized minimum peak height.A lower minimum peak height allows AdductHunter to detect more manually identified peaks, although with diminishing returns and increasing false positives (see Fig. 4).Through testing and assuming an equal weight on false positives and false negatives, a value of 0.01 was found to be optimal; decreasing the setting to 0.005 added many false positives with few manually identified peaks, likely due to noise, and increasing the value to 0.02 removed a notable portion of manually identified peaks with a less significant reduction in false positives.Another notable parameter in peak detection is the minimum distance between two (manually identified) adjacent peaks, found to be 15.9 Da over all datasets and set to 15 as a default.
The other significant parameter to be defined is the tolerance around peaks, t.Peak tolerance makes strides at accounting for noise in mass spectra and error in the mass approximation of the adduct in AdductHunter.Here, individual compound masses are summed instead of finding the most abundant isotopologue for the adduct, which is a non-linear, non-continuous, and computationally-expensive calculation.Consequently, we decided to keep the formulation linear as it is a close approximation of the true mass value.This parameter has the most flexibility, uncertainty, and, alongside the minimum peak height, is where computational efficiency in the constraint optimization is most affected.
Since t is directly proportional to the size of the feasible set, a large enough t is desired to be confident that as many manually identified species are captured in the feasible set, but not so large that numerous unwanted species are made feasible (see Fig. 5).Recall that only peaks with at least one feasible solution are returned in the output (and the best-fit species is found for each peak).As a result, a larger t will not only return more potential species, but more unique peaks as well; this brings about the detection of peaks as false positives that would not have been returned with a lower tolerance.Tolerance values were selected to be slightly larger than multiples of the atomic mass of a hydrogen atom at 1.008, that is, nH where n ∈ N , which has been approximated to n + 0.1 .It was found that for the given data and tolerances greater than 3.1, no more peaks in the manually identified were returned, meaning the missing number of manually identified peaks did not change.Hence, increasing the tolerance past this point means new peaks returned are all false positives.

Default parameters
The results of the benchmarking tests were used to set the default parameters values for AdductHunter.As a broader range of data is tested and analyzed, a parameter search would prove useful to precisely determine their optimal values.Parameter values could also be made variable and dependent on the mass.The assignment of proton adducts became more challenging for higher adducts with larger masses, as they tended to be further away from the experimental peak, making reliable identification difficult.In the used datasets, higher mass adducts at lower intensity in the mass spectra and the peaks usually were surrounded by increased noise and complexity, which comes naturally with more individual components involved in each adduct.Thus, for peaks at Fig. 5 Missing proportion of species which improves as the tolerance is increased, albeit with diminishing returns in accuracy higher masses, parameters may need to accommodate for an increased feasible set to capture the previously identified peaks in the ground truth.For example, the peak tolerance could increase as the mass of the protein adducts increases, with possibly a smaller starting peak tolerance than the constant mass tolerance (3.1) used as mass error increases with adduction complexity.Future work may also include automatically calculating the noise threshold as a function of the baseline intensity and noise level of the spectrum, instead of being a user-defined input.

Objective function analysis
A variety of established similarity measures for the objective function were tested over all datasets to determine which metric would work best.We used the "similaritymeasures" package [38] to test the following measures: the area between curves [38], Partial Curve Matching [39], discrete Fréchet distance [40], and Dynamic Time Warping [37] with Euclidean and City Block (Manhattan) distance measures.The normalized intensity values were also scaled by a range of weights -0 (effectively only using mass), 10 −1 , 10 0 , 10 1 , 10 2 , 10 3 -to understand its significance in finding the best fit.Dynamic Time Warping with an Euclidean distance measure was found to have the best average performance with a weight of 10 −1 on the intensity.However, other (tested) similarity metrics and weights may be used depending on the data (see Table 6 and Additional File 3).As noise affects the experimental intensities, more weight is applied to mass accuracy when comparing experimental and theoretical distributions.Using only mass however, performs poorly due to multiple feasible solutions having similar mass values.
A different type of objective function initially considered was to measure the mass error (ppm) at each peak p, which is a scaled form of the relative error between the peaks in the theoretical isotope pattern and experimental isotope distribution: where T (x) is the theoretical peak mass of species x.
Parts per million error calculations are a common approach in MS analyses [4], and would be substantially easier to implement and interpret than a distance metric.However, the linear mass approximation used to find theoretical mass peaks means that ppm would need to be measured after finding the feasible set to accurately calculate its value (as the isotope pattern is needed to find its peak, which is a non-linear process), hence it is unusable as an objective in the constraint formulation.Furthermore, using a full isotope pattern is more robust as there are cases where two consecutive isotope peaks have near identical abundance in the experimental spectrum, and so measured and theoretical distributions may disagree on the identity of the tallest peak, resulting in large ppm values.

Running time
Experiments were run using an Intel Core i5-8250U CPU and 8GB RAM.When calculating the objective, the total time taken for the AdductHunter analysis of a recorded spectrum was dominated by the generation of hyperfine isotopic mass distributions.In contrast, the choice of objective function had a negligible effect on the total analysis time.Across all datasets, generating the hyperfine isotope distribution took approximately 135.5 s on average.The time required to identify peaks and ( 12) generate the set of feasible species pales in comparison, taking approximately 0.41 s on average.Additionally, an approximate, coarse method for generating isotopic mass distributions exists in pyOpenMS that is significantly faster ( ∼ 85 times) than generating the hyperfine peaks, which took approximately 1.68 s on average.However, the mass values calculated using the coarse method will not accurately reflect the most abundant isotopologue peak as a simplified formula is used to find isotope peaks with greater mass [17].This imprecision leads to decreased accuracy and high error values for almost all metrics at an intensity weight of 10 -1 , although some improvements can be seen for larger intensity weights across all metrics (see Table 7).The best performance (mean accuracy of 0.787) was achieved with the Dynamic Time Warping with City Block (Manhattan) distance measures at an intensity weight of 10 0 .However, it is worse than than the one achieved with the hyperfine method (mean accuracy of 0.842, see Table 6).Hence, we recommend to use the hyperfine method, although the coarse method may be used for rapid preliminary testing.As species and peaks generated are independent of each other, further improvement on the analysis time would involve parallelizing the generation of isotope patterns, constraint integer optimization formulations, and objective function calculations.

Conclusion
AdductHunter was created to identify protein-metal complex adducts in deconvoluted mass spectrometry data by formulating a constraint integer optimization problem at each experimental mass peak and using dynamic time warping to find the best fit species based on its theoretical isotopic distribution.The results presented herein provide comprehensive evidence that AdductHunter effectively detects peaks within mass spectrometry data and accurately determines their speciation much faster than interpreting the spectra manually.Efforts are currently underway to address AdductHunter's limitations, specifically by introducing the deconvolution of experimental mass spectra as well as ensuring that it can appropriately handle samples with more than one metal complex in the incubation mixture.

Mass tolerance: 2 . 1 Fig. 1
Fig. 1 Webpage layout showing input files and parameters for peak identification and the constraint optimization formulation

Fig. 2
Fig. 2 Overview of the underlying algorithm to AdductHunter

Fig. 3
Fig.3Peaks identified in mass spectra recorded for ubiquitin (Ub) and cisplatin mixtures at low, medium, and high resolution

Table 1
Input file for a deconvoluted mass spectrum in the mass range of 8000-11,000 recorded for a mixture of Ub and cisplatin containing mass, m, and intensity, I, valuesIntensities can be of any unit as they will be normalized relative to the most abundant peak

Table 2
Species description and constraints input table for a sample containing the protein ubiquitin (Ub) and cisplatin Min, Max indicate the minimal and maximal values for each component in any identified peak M refers to the maximum number of the corresponding coordinating species per metal An empty value means there is no limit, which is expected here as these adducts do not coordinate to the metal centreThe charge needs to be provided as positive or negative integers

Table 4
Truncated output table for a spectrum recorded for a Ub/cisplatin containing feasible solutions and their corresponding measuresThe columns PO, Mass, and Peak, refer to the proton offset, theoretical nominal mass peak, and experimental mass peak, respectively

Table 6
Mean accuracy across all datasets using the hyperfine isotope generator for tested metrics at different weights on the intensityThe Partial Curve Matching measure normalizes both mass and intensity inputs to the same scale, meaning different non-zero weights on the intensity have the same effect.Dynamic Time Warping-E and Dynamic Time Warping-CB refer to the Dynamic Time Warping with Euclidean and City Block (Manhattan) distance measures, respectivelyThe best value for each metric is in italics and the overall best is in bold